{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| hide\n",
    "#| eval: false\n",
    "! [ -e /content ] && pip install -Uqq fastai  # upgrade fastai on colab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.basics import *\n",
    "from fastai.callback.all import *\n",
    "from fastai.text.all import *"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "skip_exec: true\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wikitext data tutorial\n",
    "\n",
    "> Using `Datasets`, `Pipeline`, `TfmdLists` and `Transform` in text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we explore the mid-level API for data collection in the text application. We will use the bases introduced in the [pets tutorial](http://docs.fast.ai/tutorial.pets.html) so you should be familiar with `Transform`, `Pipeline`, `TfmdLists` and `Datasets` already."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = untar_data(URLs.WIKITEXT_TINY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset comes with the articles in two csv files, so we read it and concatenate them in one dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(path/'train.csv', header=None)\n",
    "df_valid = pd.read_csv(path/'test.csv', header=None)\n",
    "df_all = pd.concat([df_train, df_valid])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\\n = 2013 – 14 York City F.C. season = \\n \\n The 2013 – 14 season was the &lt;unk&gt; season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \\n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\\n = Big Boy ( song ) = \\n \\n \" Big Boy \" &lt;unk&gt; \" I 'm A Big Boy Now \" was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including \" Big Boy \" . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \\n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\\n = The Remix ( Lady Gaga album ) = \\n \\n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and &lt;unk&gt; composit...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\\n = New Year 's Eve ( Up All Night ) = \\n \\n \" New Year 's Eve \" is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica &lt;unk&gt; and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \\n During Reagan ( Christina Applegate ) and Chris 's ( Will &lt;unk&gt; ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>\\n = Geopyxis carbonaria = \\n \\n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family &lt;unk&gt; . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf &lt;unk&gt; cup , &lt;unk&gt; &lt;unk&gt; cup , or pixie cup . The small , &lt;unk&gt; @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0\n",
       "0   \\n = 2013 – 14 York City F.C. season = \\n \\n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \\n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...\n",
       "1   \\n = Big Boy ( song ) = \\n \\n \" Big Boy \" <unk> \" I 'm A Big Boy Now \" was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including \" Big Boy \" . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \\n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...\n",
       "2   \\n = The Remix ( Lady Gaga album ) = \\n \\n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...\n",
       "3   \\n = New Year 's Eve ( Up All Night ) = \\n \\n \" New Year 's Eve \" is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \\n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...\n",
       "4   \\n = Geopyxis carbonaria = \\n \\n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ...."
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could tokenize it based on spaces to compare (as is usually done) but here we'll use the standard fastai tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jhoward/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
      "  return array(a, dtype, copy=False, order=order)\n"
     ]
    }
   ],
   "source": [
    "splits = [list(range_of(df_train)), list(range(len(df_train), len(df_all)))]\n",
    "tfms = [attrgetter(\"text\"), Tokenizer.from_df(0), Numericalize()]\n",
    "dsets = Datasets(df_all, [tfms], splits=splits, dl_type=LMDataLoader)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs,sl = 104,72\n",
    "dls = dsets.dataloaders(bs=bs, seq_len=sl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>text_</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>xxbos = xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \\n▁\\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic</td>\n",
       "      <td>= xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \\n▁\\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic xxmaj</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>, who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for</td>\n",
       "      <td>who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for the</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>jewish xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \\n▁\\n▁ = = = xxmaj articles and chapters = = = \\n▁\\n▁ \" xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj</td>\n",
       "      <td>xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \\n▁\\n▁ = = = xxmaj articles and chapters = = = \\n▁\\n▁ \" xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj immemorial</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "dls.show_batch(max_n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = awd_lstm_lm_config.copy()\n",
    "config.update({'input_p': 0.6, 'output_p': 0.4, 'weight_p': 0.5, 'embed_p': 0.1, 'hidden_p': 0.2})\n",
    "model = get_language_model(AWD_LSTM, len(dls.vocab), config=config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opt_func = partial(Adam, wd=0.1, eps=1e-7)\n",
    "cbs = [MixedPrecision(), GradientClip(0.1)] + rnn_cbs(alpha=2, beta=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), opt_func=opt_func, cbs=cbs, metrics=[accuracy, Perplexity()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>perplexity</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>5.503713</td>\n",
       "      <td>5.095897</td>\n",
       "      <td>0.237340</td>\n",
       "      <td>163.350342</td>\n",
       "      <td>02:07</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.fit_one_cycle(1, 5e-3, moms=(0.8,0.7,0.8), div=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#learn.fit_one_cycle(90, 5e-3, moms=(0.8,0.7,0.8), div=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "split_at_heading": true
  },
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
